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ABSTRACT 

Motivation: Most models of genome evolution integrating gene 
duplications, losses and chromosomal rearrangements are computa- 
tionally intractable, even when comparing only two genomes. This 
prevents large-scale studies that consider different types of genome 
structural variations. 

Results: We define an 'adjacency phylogenetic tree' that describes 
the evolution of an adjacency, a neighborhood relation between two 
genes, by speciation, duplication or loss of one or both genes, 
and rearrangement. We describe an algorithm that, given a species 
tree and a set of gene trees where the leaves are connected 
by adjacencies, computes an adjacency forest that minimizes 
the number of gains and breakages of adjacencies (caused by 
rearrangements) and runs in polynomial time. We use this algorithm 
to reconstruct contiguous regions of mammalian and plant ancestral 
genomes in a few minutes for a dozen species and several thousand 
genes. We show that this method yields reduced conflict between 
ancestral adjacencies. We detect duplications involving several 
genes and compare the different modes of evolution between phyla 
and among lineages. 

Availability: C++ implementation using BIO++ package, available 
upon request to Severine Berard. 

Contact: ISeverine.Berard@cirad.frorEric.Tannier@inria.frl 
Supplementary information: [Supplementary material s available at 
Bioinformatics online. 



1 INTRODUCTION 

A phylogenetic tree describes the kin relationships between a set 
of homologous objects. Non-homologous objects may have other 
types of relationships, such as interactions, functional relationships, 
co-expression or neighborhood between genes. Studying the 
pattern of descent of these relationships can be used to define 
homology between them, reconstruct ancestral relationships and 
build phylogenetic trees. 

The evolution of gene proximity or interaction has been the 
subject of numerous recent studies. It is for example a way to assess 
co-evolution between genes, even if often co-evolution is detected 
by searching for similarities in gene trees, but without modeling 
expli citly the relation that make the genes co-evolve ( Rodionov 
et a/.. l201ll : lTuller et a/.Ll2010l). 

Clo ser to our study, jPinnev et all I2007I) and (D utkowski and 
Tiurvn j2009h or dlvla et a/.L I2008I) propose methods to reconstruct 



ancestral protein-protein interactions or gene neighborhoods based 
on a model of evolution allowing gene duplications. They, however, 
assume that the chronology o f dupli cations is known, which often 
is not the case. jPatro et all l201ll) define a general problem of 
network evolution without this assumption and give a heuristic 
solution for the comparison of two species. Our model considers 
the more specifi c problem of gene n eighborhoods on chromosomes, 
but generalizes jPatro et all |20 lib 's method in that it handles an 
arbitrary number of species and provides an exact solution to a less 
constrained problem. 

Several methods are aimed at building ancestral chromosomes 
(which can be seen as relationships between genes). Most of these 
methods, however, ignore duplications and losses and are limited 
to gene famili es which have exactly one representative in each 
studied species ( Aleksevev and Pevznei J2009l;lcnauve and Tanniei , 



2008I : IChauve et all l2010l : iMa et a/ll2006l : lOuangraoua et al 



201 II) . The number of such gene families becomes smaller and 
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smaller as the number of species grows. Som e methods take as 
input gene trees allow ing duplications and losses (lLaioie a/ll2010l : 
iMuffato et all l2010h but do not model these events and treat them 
as noise that is removed for the construction of chromosomes b y 
traveling salesman-l i ke opt imiz ation methods. 
( lOuangraoua et all l201ll) or jzheng and Sankoffl l201ll) model 
duplications only in the context of whole genome duplications. 

Here, we propose a method that takes a species tree and a set 
of gene trees as inputs, and models the gain and breakage of gene 
adjacencies along a pair of trees, taking duplications and losses into 
account. We consider two genes to be 'adjacent' if they are on the 
same chromosome in the same genome and no other gene is located 
between the two. We give an exact polynomial algorithm which 
minimizes the number of gains and breakages of adjacencies, or 
more generally, the gain/breakage cost of an evolutionary scenario 
for gene adjacencies. The result consists of sets of 'adjacency trees', 
which are phylogenetic trees describing the evolution of a family of 
homologous adjacencies (adjacencies that share a common ancestor 
and derived from it). 

We assume that adjacencies evolve independently from each 
other, so we do not model the rearrangement explicitly (inversions, 
translocations etc.), but model their effect on adjacencies, which 
thus can undergo gains and breakages. 

Doing this, we solv e a problem that fits in the method ological 
program started by ISankoff and El-Mabrouki 120001) . which 
mixes rearrangements and reconciliations of phylogenetic trees (a 
reconciliation is an annotation of gene tree nodes by duplication or 
speciation events, according to a species tree). 

Algorithmically, the dynamic programming principle we use 
generalizes the Sankoff-Fitch (iFitchL 1 197 it ISankofA 1 19751) 



© The Author(s) 2012. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.0rg/licenses/by/3.O), which 
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 



Evolution of gene neighborhoods 



parsimony algorithms on binary alphabets. Indeed, when there are no 
duplications nor losses in gene trees, adjacencies may be described 
by a binary character (presenc e or absence in a genom e) evolving 
along the sp ecies tree, as in (iTang and Wangj 120051) or ( Feiiao 
and Meidanis. 1201 ll) . In our case, there is also a binary character 
(presence or absence of an adjacency), but evolving along pairs of 
reconciled gene trees. 

The description of the method requires that we formally introduce 
the three kinds of trees we handle (species, gene and adjacency 
trees), as well as the definition of the optimization problem we 
propose (Section 0). We detail some properties of the solutions, 
pointing at the possible advantages and drawbacks of this approach 
in Section [3] Then in Section |4] we describe the algorithm which 
solves t he problem. Proofs are provided in the Supplementary 
Material. 

We implemented the algorithm and applied it to mammalian and 
plant genomes. We show that compared with other methods not 
explicitly modelling evolutionary events, we have more precise and 
less erroneous views at ancestral genome organization (Section 0. 
In addition, we are able to detect segmental duplications including 
several genes, and to visualize how much the modes of evolution 
are different according to the considered clades or lineages. 

2 MODEL 

All the trees in this article have one or more vertices, they are rooted 
and have maximum degree 3. A tree T induces a partial order on its 
nodes, where descendants are lower than ancestors. 

For a tree T, V(T) denotes its vertex set and L(T) is leaf set 
(vertices with no descendants, L(T) c V(T)). For a node N of T, 
T(N) is the subtree of T rooted at N. P(N) is the parent of N (it is 
defined only if N is not the root). L(N) is the set of leaves of T(N). 

We consider all trees to be annotated, which means here that each 
node N of a tree is labeled by 

• A type of event E(N). 

• A species S(N). 

The events E(N) are taken from {Extant, Spec, GDup, 
GLos, ADup, ALos, Break}. These are abbreviations for 
'Extant', 'Speciation', 'Gene duplication', 'Gene loss', 'Adjacency 
duplication', 'Adjacency loss' and 'Adjacency breakage'. Together 
with the 'Adjacency Gain' (abbreviated Gain, which never labels the 
nodes of the trees as there is exactly one gain per adjacency tree), 
they are all the evolutionary events we consider^ Note that ALos 
means the loss of an adjacency due to the concurrent loss of the two 
involved genes, while Break means the loss of an adjacency due to 
a rearrangement. In the objective function we only take Gain and 
Break into account, so both are given a cost C(Gain) and C(Break). 

All trees depend on a set of extant genomes, which are disjoint 
sets of genes plus binary relations on these sets of genes called 
'adjacencies'. The two genes of an adjacency are called its 
'extremities'. There are three types of trees (illustrated in Fig. [TJ, 
which have the following properties. 

(1) A species tree T$ describes the diversification of species. It is 
binary, and verifies E(N) = Spec for all internal node N and 

: Even if 'Extant' is not an evolutionary event, it is included because it 
annotates some tree leaves. 




Fig. 1. Examples of a species tree (left), two gene trees (middle) and an 
adjacency tree (right). Blue dots are speciation nodes. Leaves are extant 
(species, genes, adjacencies), except the one labeled by a red cross (gene loss) 
or a red flash (breakage). Green squares are (gene or adjacency) duplication 
nodes. Gene labels refer to the species they belong to. Every node of the 
adjacency tree is labeled by a couple of nodes from gene trees 

E(N) = Extant for all leaves N. All S(N) are distinct species 
and if N is an internal node, S(N) defines an ancestral species. 

(2) A gene tree Tq describes the evolution of a family of 
homologous genes along a species tree T$. All gene trees 
here are 'recon ciled' with the 'LCA (L ast Common Ancestor) 
reconciliation' dGoodman et all 1 19791) where all gene losses 
are represented by leaves, which means every node N verifies: 

• If N is a leaf, then E(N) e {Extant, GLos], and if N is an 
internal node, then E(N) e {Spec, GDup}. 

• If E(N) = Extant, then there is a gene G(N) that belongs 
to S(N) and all such genes are distinct. 

• If E(N) = GDup then the children Nl and N2 of N are 
such that S(N\) = S(N2) = S(N) 

• If E(N) = Spec then the children Nl and N2 of N are 
such that there are two edges AAl and AA2 of T$ 
such that P(Al) = P(A2)=A and S(N) = S(A), S(Al) = 
S(Nl),<mdS(A2) = S(N2). 

• Let C be the set of leaves of T G (N); Let S(C) be the 
set of all extant species which are descendants of some 
S(l), leC; Let now N$ be the lowest node in T$ such 
that 5(£)cu ZeL( V5 )5(0; Then, S(N) = S(N S ). 

(3) An adjacency tree T^ describes the descent pattern of 
adjacencies. As adjacencies are pairs of genes, they follow 
the evolution of genes: if an adjacency AB descents from 
an adjacency CD, then A descents from C and B from D. 
So adjacency trees are defined given a set of reconciliated 
gene trees Tq and have to follow their LCA reconciliations. 
Formally, every node N of an adjacency tree verifies 

• If N is a leaf, then E(N)e {Extant, GLos, ALos, Break] 
and if N is an internal node, then E(N)e 
{Spec, GDup, ADup} 

• If E(N)^Break, then there is a couple A(N)=XY of 
gene tree nodes X and Y (possibly from two different 
gene trees) such that S(N) = S(X) = S(Y). 

• If E(N) = Extant, then G(X)G(Y) is an adjacency. 

• If E(N) = GLos, then E(X) = GLos or E(Y) = GLos (and 
not both). 

• If E(N) =ALos, then E(X)=E(Y) = GLos. 

• If E(N) = Spec, then E(X)=E(Y)=E(N). In addition, 
N has two children N 1 and N2 and either E(N l) = Break 
(respectively, E(N2) = Break)) or A(Nl) (respectively, 
A(N2)) is a couple of children of X and Y. 
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• If E(N) =ADup, then E(X) = E(Y) = GDup. In addition, 
N has two children Nl and N2 either E(N1) = Break 
(respectively, E(N2) = Break)) or A(Nl) (respectively, 
A(N2)) is a couple of children of X and Y. 

• If E(N) = GDup, then E(X) = GDup or E(Y) = GDup 
(suppose it is Y). In addition, Af has only one child N\ 
and either E{N\) = Break or A(Afl) is a couple of genes 
composed of X and one child of Y. 

An 'adjacency forest' is a set of adjacency trees, such that for two 
nodes N\ and N2 in this forest, A(N\) ^=A(N2), and such that for 
each adjacency A from any species, there exists a leaf L in the forest, 
which verifies A(L)=A. 

The cost of an adjacency tree T A , is 

C(T A ) = | {v e V (T A ), E(v) = Break} | * CiBreak) + Gain(T A ), 

where Gain(T A ) is computed in this way: if the root R of T A is such 
that A(R)=XY and either 

• P(X) = P(7)or 

• X is the root of a gene tree, and either Y is also a root, or 
S(P(Y))^S(Y) 

then Gain(T A ) = 0, else Gain(T A ) = C(Gain). The cases where 
Gain(T A ) = 0 are those arising from tandem duplications or those 
where the adjacency can have been gained earlier in the evolution. 

The cost of an adjacency forest is the sum of the costs of all 
adjacency trees. 

The problem we address is to take as input a species tree, a set 
of gene trees and a set of extant adjacencies, and to compute an 
adjacency forest of minimum cost. We give a polynomial algorithm 
which gives one optimal solution. 

3 PROPERTIES 

3.1 The cost of a duplication or loss event 

The optimization focuses only on breaks and gains of adjacencies. 
The dynamic programming technique we use does not allow to count 
duplication and loss events in the objective function. This is because 
we make the hypothesis of independent evolution of couples of 
genes, and as long as one gene has its own events and belongs 
to several couples, this independence is broken. 

Nevertheless, duplication events have an importance for the 
solutions. The duplication of an adjacency has the same cost as 
the independent duplication of two genes, but the events can still 
be discriminated because the two do not have the same effect: the 
independent duplications propagate only one adjacency, and the joint 
duplication propagates two. It is thus possible to catch the places 
where a joint duplication is advantageous in terms of gains and 
breaks. 

3.2 The linearity of genomes 

In extant genomes, one gene can participate in at most two 
adjacencies. We have not required this property in the input of the 
program because it is not used, and in this way we could easily adapt 
the problem to other kinds of relationships. The drawback of this is 
that there is no need that in ancestral genomes, genes participate to 
at most two adjacencies. 



(iFeiiao and MeidanTsl 1201 ll) prove that in a duplication-free 
framework, where Fitch's algorithm is applied on the presence and 
absence of adjacencies, choosing the absence whenever there is a 
choice to make ensures that the resulting genomes are linear. When 
there are duplications, it is not necessarily the case and it can be 
seen in the data, where some conflicts remain. But as we will see 
in the last section, the amount of conflict is reduced compared with 
other kinds of algorithms, and can be used to assess the quality of 
the gene trees, as well as the quality of the model. 

3.3 The chronology of duplications 

No c hronology of duplications is required in the input as in fPi nnev 
et al. J2007I) . foutkowski and TiurvnLl2009h or ( iMa et a/.Ll2008l) . But 
a chronology can be derived from the output. Indeed an adjacency 
duplication means that two genes are duplicated together, while two 
nodes of an adjacency tree such that one is the descendant of the 
other and both are gene duplication events define a directed relation 
between the two duplications, even if they are not comparable from 
the gene trees (not in the same tree or not comparable in one tree). But 
this relation is not necessarily an order relation. There are examples 
where temporal relationships defined by adjacency trees contradict 
the partial order of the nodes of one gen e tree: see such an e xample in 
Figure SI in [Supplementary Material| jPatro et fl/ll201ll) proscribe 
this kind of conflict and propose a heuristic principle to get rid of it 
when it happens. 

3.4 Tandem duplications 

Tandem duplications are special types of duplications, where the two 
duplicates are adjacent. Here, tandem duplications are not modeled 
explicitly as a different event from ordinary duplications. However, 
tandem duplications of one gene are indirectly taken into account: 
they cost zero (as the gain of an adjacency between two children of 
a duplication node is costless; see Section [2}, while a non-tandem 
duplication of one gene can cost one breakage plus two gains when 
one duplicate is inserted between two other genes. 

3.5 The orientations of the genes 

It is possible to take the orientation of the genes into account 
by duplicating each gene into two gene extremities and define 
adjacencies as relations between gene extremities instead of genes 
(the extremities of an adjacency are gene extremities in that case). 
The current implementation can be used in this way, it is just a matter 
of formatting the input. In this case, one gene extremity is supposed 
to participate in only one adjacency, and tandem duplications are 
not handled anymore, because there are no duplications of only one 
gene extremity. 

4 ALGORITHM 

4.1 Restriction to two gene trees 

We first restrict the problem to the comparison of two gene trees, 
without loss of generality. To do this, the extant adjacencies are 
clustered according to the following relation between two distinct 
adjacencies AB and CD: 

I). A and C are in the same gene tree, noted Gl, as well as B and 
D, in a tree noted G2 (the roles of A and B and of C and D 
may be exchanged). 
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2a). if Gl ^G2, then there are two nodes N\eG\,N2eG2 such 
thsLtS(N\) = S(N2) andA and C are descendants of Nl, while 
B and D are descendants of N2. 

2b). if G\ = G2, then the lowest common ancestor of A and B is 
the same node as the lowest common ancestor of C and D (it 
is necessarily a duplication node). 

This relation between adjacencies satisfying all conditions is 
an equivalence relation (reflexive, symmetric and transitive). 
Equivalence classes are treated independently. This is justified by 
the following lemma, whose proof stands in the Supplementary 
Material. 

Lemma 1. If there is a tree of adjacencies which contains 
adjacencies AB and CD, then AB and CD are in the same 
equivalence class. 

In other words, if AB and CD are not in the same class they cannot 
be homologous. The converse is not true however. Solutions for one 
class may consist of several adjacency trees. 

This clustering allows to divide the problem into equivalence 
classes, which concern one or two gene trees. If in an equivalence 
class, adjacencies have extremities in the same tree, by definition of 
the classes, there is a common ancestor to all pairs of extremities 
of adjacencies. By removing this vertex, we get two trees rooted at 
its children, and all adjacencies have one extremity in each of these 
two trees. 

So we may restrict ourselves to this case where we have exactly 
two gene trees and all adjacencies are between these two trees. 
Moreover, we may suppose that each tree is rooted at the lowest 
common ancestor of all genes involved in adjacencies of the chosen 
class, because we may simply consider the subtree rooted at this 
vertex. This yields that the two roots are necessarily assigned to the 
same species. 

4.2 Recurrence formulas 

Formally, we have two gene trees Tq and Tq, extant adjacencies 
have one extremity in each tree, and if and R 2 are the respective 
roots of T l Q and T 2 , then S(R X ) = S(R 2 ). 

For a pair of nodes (v 1 , v 2 ) e V(T^) x V (T 2 ) such that ^(v 1 ) = 
S(v 2 ), we compute two values, c\{v^ ,v 2 ) and cqO^v 2 ) by 
recurrence formulas described in the sequel. Remark that we only 
consider pairs of nodes annotated with the same species because an 
adjacency is always linking genes from the same genome. We prove 
that these numbers have the following properties (proofs are in the 
[Supplementary Material! Appendix 2). 

Theorem 1. • ci (v 1 ,v 2 ) is the minimum cost of an adjacency 
forest F for the adjacencies between two gene trees Tq(v^) 
and Tq(v 2 ), such that there is a node N in F withA(N) = v 1 v 2 . 

• cq(v^,v 2 ) is the minimum cost of an adjacency forest F for 



the adjacencies between two gene trees Tq(v^) and Tq(v 2 ), 
such that there is no node N in F with A(N) - 



= vV. 



In consequence, the minimum cost of an adjacency forest will 
be given by computing the minimum between ci(R^,R 2 ) and 
c 0 (R\R 2 ). 



The recurrence for the computation of c\{v^ ,v 2 ) and q^v^v 2 ) 
follows a case analysis, according to the type of event associated to 
v 1 and v 2 . The roles of v 1 and v 2 are symmetrical. We note ca(v) 
and cb(v) the two children of a node v. 



Case 1. E(v 1 ) = Extant and E(v 2 ) = Extant. 

If v*v 2 is an adjacency then ci(v 1 ,v 2 ) = 0 and cq(v^, v 2 ) = oo; 

else c\ (v 1 , v 2 ) = oo and cqCv 1 , v 2 ) = 0. 



Case 2. E(y l ) = GLos and E(y 2 ) ^ GLos. 
In this case c^v 1 , v 2 ) = 0 and cq(v 1 ,v 2 ) = 0. 



Case 3. E(y l ) = GLos and E(v 2 ) = GLos. 
In this case ci(v 1 ,v 2 ) = 0 and cq(v 1 ,v 2 ) = 0. This case has to be 
distinguished from the previous one for the backtracking procedure 
described in the following subsection. 



Case 4. E(v l ) e {Extant, Spec] and E(v 2 ) = GDup. 

\cb(v 2 )) 

c 1 (v 1 ,v 2 ) = min| ^0(v;,ca(v-)) + c l( vj,^(v2)) 

1 ^,.2^ , , c b(v 2 )) + C(Gain) 

v2 



co(v 1 ,v 2 ) = min 



' ciO 1 


,ca{v 2 )) + CQ{v X 


co(v ! 


^(v^ + ciO 1 


ciO 1 


,ca(v 2 )) + ci(v l 


. coO 1 


,ca(v 2 )) + c 0 (v l 


coO 1 


,cci{v 2 )) + cq(v 1 


coO 1 


,ca{v 2 )) + c\(v l 


ciO 1 


,ca(v 2 )) + CQ(v l 


. ciO 1 


,ca{v 2 )) + ci{v l 



,cb(v l )) + C(Break) 
,cb(v 2 )) 

,cb(v 2 )) + C(Gain) 
,cb(v 2 )) + C(Gain) 
,cb(v 2 )) + 2*C(Gain) 



Case 5. E(v 1 ) = Spec and E(v 2 ) = Spec. 
Suppose without loss of generality that ^(^(v 1 ))^^^^ 2 )) 
and S(cb(v 1 )) = S(cb(v 2 )). 



c 1 (v 1 ,v 2 )= 



c 0 (v 1 ,v 2 )= 



c\(ca(v 
c\(ca(v 
c 0 (co(v 
c 0 (ca(v 
c 0 (ca(v 
c\(ca(v 
c 0 (ca(v 
c\(ca(v 



),ca(v 2 )) + c l (cb(v 1 ),cb(v 2 )) 
) , caiy 2 )) + c 0 (cb(v 1 ) , cbiy 2 )) + C (Break) 
),ca(v 2 )) + ci(cb(v l ),cb(v 2 )) + C(Break) 
) , caiy 2 )) + c 0 (cbiy 1 ) , cbiy 2 )) + 2 * C (Break) 
),ca(v 2 )) + c 0 (cb(v l ),cb(v 2 )) 
),ca(v 2 )) + c 0 (cb(v l ), cb(v 2 )) + C(Gain) 
), ca(v 2 )) + C1 (cb(v l ), cb(v 2 )) + C(Gain) 
), ca(v 2 )) +ci (cb(v l ), cb(v 2 )) + 2*C(Gain) 



Case 6. ^(v 1 ) = GDup and E(v z ) = GDup. 

In this case c\(v^, v 2 ) = min(D I, D2,D 12) where 

Dl is the cost in the case the v 1 duplication comes first, 

D2 is the cost in the case the v 2 duplication comes first, 

D\2 is the cost in the case where the v 1 and v 2 duplications are 

simultaneous. 

ici(ca(v l ),v 2 ) + c 0 (cb(v l ),v 2 ) 
c 0 (ca(v l ),v 2 ) + ci(cb(v l ),v 2 ) 
ci (ca(v l ), v 2 ) + Cl (cb(v l ), v 2 ) + C(Gain) 
c 0 (ca(v 1 ) , v 2 ) + c 0 (cb(v 1 ) , v 2 ) + C (Break) 



Dl 



D2 = min 



D12 = 



cxO 1 ,ca(v 2 )) + ci(v l ,cb(v 2 )) + C(Gain) 
c 0 (v 1 , ca(v 2 )) + c 0 (v 1 , cb(v 2 )) + C (Break) 
min (over all 16 following cases) 
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c\(cb(v x ),cb(v 2 )) + 



(3) 



(5) 



(6) 



(7) 



c l (ca(v 1 ),ca(v 2 )) + c\(cb(v l ),cb(v 2 )) + 

CQ(ca(v l ),cb(v 2 )) + ci(cb(v 1 ),ca(v 2 )) + 
C(Gain) 

ci(ca(v l ),ca(v 2 )) + c\(cb(v l ),cb(v 2 )) + 

ci(ca(v l ),cb(v 2 )) + c 0 (cb(v l ),ca(v 2 )) + 
C(Gain) 

(4) ci(ca(v l ),ca(v 2 )) + Cl (cb(v l ),cb(v 2 )) + 

^(c^v 1 ),^ 2 )) + c^cKv^^a^ 2 )) + 
2* C '(Gain) 

c\(ca(v l ),ca(v 2 )) + cq^v^c^v 2 )) + 

^(^(v 1 ),^ 2 )) + c Q (cb(v l ),ca(v 2 )) + 
C(flrea£) 

c^ca^Wv 2 )) + co^v 1 ),^ 2 )) + 

c 0 (ca(v l ),cb(v 2 )) + c x (cb(v l ),ca(v 2 )) + 

C(Gain) + C(Break) 

c x (ca(v l ),ca(v 2 )) + c Q (cb(v l ),cb(v 2 )) + 

c\(ca(v l ),cb(v 2 )) + c 0 (cb(v 1 ),ca(v 2 )) + 

C(Gain) + C(Break) 

(8) c 0 (ca(v l ),ca(v 2 )) + c^v 1 ),^ 2 )) + 
c 0 (ca(v l ),cb(v 2 )) + c 0 (cb(v 1 ),ca(v 2 )) + 

(9) c 0 {ca{v l ),ca{v 2 )) + ^(^(v 1 ),^ 2 )) + 
CQ(ca(v l ),cb(v 2 )) + c x (cb(v l ),ca(v 2 )) + 
C(Gain) + C(Break) 

(10) coMv^catv 2 )) + ci(cb(v l ),cb(v 2 )) + 
c^caCv 1 ),^ 2 )) + coWv 1 ),^^ 2 )) + 
C(Gain) + C{Break) 

(11) coCoVWv 2 )) + co^v 1 ),^ 2 )) + 
c^caCv 1 ),^ 2 )) + c\(cb(v 1 ),ca(v 2 )) 

(12) qKca^Wv 2 )) + ci{cb{v l \cb{v 2 )) + 
^(c^v 1 ),^ 2 )) + ^(cMv 1 ),^ 2 )) + 
C(Gain) 

(13) c^ca^ 1 ),^ 2 )) + c 0 (cb(v l ),cb(v 2 )) + 
c^ca^ 1 ),^ 2 )) + c^c^v^^a^ 2 )) + 
C(Gain) 

(14) coCcflCv 1 ),^ 2 )) + c 0 {cb{v l \cb{v 2 )) + 
c^ca^ 1 ),^ 2 )) + co^vV^v 2 )) + 
C(Break) 

(15) ^(^(v^^^v 2 )) + co^v^.^v 2 )) + 
^(ca^ 1 ),^ 2 )) + c\(cb(v l ),ca(v 2 )) + 
C(Break) 

(16) c Q (ca(v l ),ca(v 2 )) + co^v 1 ),^ 2 )) + 
c 0 (ca(v l ),cb(v 2 )) + c 0 (cb(v l ),ca(v 2 )) + 
2* C (Break) 

Cases where the v 1 duplication comes first 



CQ(ca(v l ), v 2 ) + c\ (cb(v l ), v 2 ) + C(Gain) 
ci (ca(v l ), v 2 ) + c 0 (cb(v l ), v 2 ) + C(Gain) 
c\ (ca(v l ), v 2 ) + c\ (cb(v l ), v 2 ) + 2* C(Gain) 
Cases where the v 2 duplication comes first 



cqCv 1 ,v 2 ) = min 



c 0 (v l ,ca(v 2 )) + c 0 (v l ,cb(v 2 )) 
c 0 (v\ ca(v 2 )) + ci (v 1 , cb(v 2 )) + C(Gain) 



. c\ (v 1 , ca(v 2 )) + C! (v 1 , cb(v 2 )) + 2* C(Gain) 

We do not examine the case E(v^) = Extant and E(v 2 ) = Spec 
because in this case ^(v 1 ) ^^(v 2 ). 

The algorithm implements these recurrence formulas in an 
iterative way following a dynamic programming technique, by 
computing the costs in a post-order traversal of the couples of tree 
nodes. 

4.3 Backtracking 

The recurrence formulas allow the construction of a cost matrix 
which rows are the nodes of the first gene tree, and columns are the 
nodes of the second gene tree. The nodes of the adjacency forest 
are constructed while backtracking on the cost matrix starting at 
min(ci (R 1 ,R 2 ), coiR 1 ,R 2 )). The backtracking procedure classically 
follows each cost on the chosen line in the recurrence formulas, 



creating adjacency trees from root to leaves. A node N with A(N) = 
v*v 2 is created each time c\(v^,v 2 ) is chosen. The event labeling 
this node depends on the events labeling v 1 and v 2 : Extant for 
Case 1, GLos for Case 2, ALos for Case 3, GDup for Cases 4 and 6. 
(D1&D2), Spec for Case 5 and ADup for Case 6 (D12). A node N 
with E(N)= Break is created each time there is a C (Break) in the 
chosen formula. 

Edges between the nodes follow the pattern of descent between 
adjacencies: 

• Break nodes are leaves, and their parent are the nodes 
constructed in the formula where C (Break) occurs; 

• In Cases 4 and 6 (D1&D2), there is an edge between v 1 v 2 and 
one of v^ca(v 2 ), v^cb(v 2 ), ca(v^)v 2 , cb(v^)v 2 , if c\ is chosen 
for either of them. 

• In Cases 5 and 6 (D12), there is an edge between v*v 2 
and one or two of ca(v^)ca(v 2 ), cb(v^)cb(v 2 ), cb(v^)ca(v 2 ), 
ca(v^)cb(v 2 ) if c\ is chosen for either of them (there can be 
arbitrary choices for equivalent solutions). 

Recurrence formulas imply that the backtracking procedure does 
not create twice the same node: each formula computes the cost for 

v 1 v 2 between pairs of nodes where at least one is a descendant of 

1 2 
v 1 or v . 

An example of an algorithm input and output is drawn on Figure[2] 
4.4 Complexity 

The algorithm takes as input a dataset composed by a species tree, 
several gene trees and a list of adjacencies. It first computes the 
equivalence classes of adjacencies. Then for each class it constructs 
two subtrees to compute cq and c\ costs on their roots and applies 
the backtracking procedure. The algorithm outputs the adjacency 
forest resulting from the union of all adjacency forests built on each 
class. 

Let n be the number of gene trees and k be the maximum size 
of a tree. The algorithm runs in 0(n 2 x k 2 ). Indeed, the maximum 
number of adjacency equivalence classes is bounded by 0(n 2 ), while 
for each equivalence class, every couple of node is examined with 
a constant- time case analysis. 

In practice, the number of equivalence classes is much closer to 
n than to n 2 and trees are small compared with the total number of 
genes. For all datasets we tested, including dozens of species and 
thousand of genes, the execution time was under 10 min. 



5 IMPLEMENTATION AND APPLICATION 

We i mplemented the algorithm using the Bio++ platform ( Dutheil 
etal hOOdb " The program, named DeCo (Detection of Co-evolution 
or DeCoration of trees), takes as input a species tree, a set of genes 
along with the species they are in, a set of adjacencies and a set of 
gene trees. 

We tested it on four datasets, with costs C(Gain) = C(Break) = 1. 
The first and second datasets are based on 5039 gene trees from the 
Ensembl database (release 57) restricted to mammalian assembled 
genomes (11 species) 0 The first set of trees are those provided in this 

2 The parsimony framework practically makes it necessary to work with only 
assembled genomes, since we would count to many breaks for unassembled 
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Evolution of gene neighborhoods 



Input 



Output 



A B 
S 




Adjacency list : (Ai A 3 , B1B3, A 2 A A , B 2 B 4 ) 



/X-C 4 c 7 



AiA, 



A2A 4 B2B 4 ^1^3 



Fig. 2. Example of the application of the algorithm on two genes trees, G\ 
and G2, a species tree S and an adjacency list shown on the line Input. 
The costs are C(Gain) = C (Break) = 1 . All the costs Cb(Ei,Ej) are computed 
for beO,l, EeA,B,C, *je[1..8], with E t in G { and Ej in G 2 . As a result 
cq(Cs,C%) = 2 while c\(Cs,C%)=\. Therefore, the adjacency forest on the 
line Output contains CsC%. The left tree has cost 0 while the right one costs 
C(Gain) = 1 for the gain of the adjacency B 1B3 



datab ase, made according to the TreeBeST pipeline jVilella et all 
120091) . T he second consists of the trees reconstructed by the PhylDog 
method feoussau et a/.Ll2Q12h . with an explicit model of duplication 
and losses of trees . Both sets of trees wer e reconciled according to 
the LCA method dGoodman et all Il979h , which gives gene trees 
with the properties written in Section [2} . 

Then, we computed ancestral adjacencies according to the method 
described here, and compared with the 'pa irwise alternative', an 
imple m entation of the princ ip les used bv iChauve and TannierL 
12008b iMuffato et a/.lEoioL dBertrand et g/.ll201Cb or ( Boussau 
et a/. j2012l) , in which adjacencies are constructed bv comparing 
couples of species (the method is described in the Supplementary 
Material, Appendix 3) instead of all genomes together. 

We computed the degree of each ancestral gene, that is, the number 
of adjacencies which has it as an extremity. As shown in Figure [3] 
most ancestral genes have degree 2, which means the signal of 
linearity of the ancestral genomes is recovered. We can observe 
the gain obtained by using PhylDog trees instead of TreeBeST trees 
(red plain versus blue dotted line), and the gain obtained by using 
DeCo instead of the pairwise alternative (red plain versus green 
dashed line). These two gains are nearly equivalent, showing that 
to get better ancestral genomes, we need good trees as well as good 
adjacency inference algorithms. Better trees tend to give a better 
estimate of the ancestral gene content, minimizing the degree 0 
(probably wrong) genes, while the adjacency inference algorithm 
may minimize the number of genes with degree >2: convergent 
evolution can yield false ancestral adjacencies, which add to the 
two true ones. Convergent evolution is impossible to handle in a 
pairwise method. 

The third and fourth datasets are constructed from the Ensembl 
(release 65) and EnsemblPlant (release 12) databases, restricted 
to some assembled mammalian (11 species, 19217 gene trees 

ones, preventing the reconstruction of some ancestral adjacencies. It is 
possible to envisage branch specific costs, where unassembled genomes 
would have low breakage cost, and then could be used in a dataset. 



o 

Q. 

o 

CL 




PhylDog-DeCo 
— PhylDog-pairwise 
• • TreeBeST-DeCo 



0 1 2 3 4 5 6 7 

Number of neighbors 

Fig. 3. Proportion of genes having k neighbors, function of k. Red plain 
line is obtained with DeCo and PhylDog trees. Green dashed line is obtained 
with PhylDog trees and the pairwise alternative. Blue dotted line is obtained 
with TreeBeST trees and DeCo 

with an average of 10 genes) and angiosperm (9 genomes, 35 182 
gene trees, with an average of 9 genes) species. We chose these 
two clades for a phylogenetic comparison because the estiated 
divergence times are similar, and there are approximatively as many 
assembled genomes in both. We compared the number of segmental 
duplications involving more than one gene in these two datasets. 
In Figure [4] phylogenetic trees of mammals and angiosperms are 
drawn, in which branch length is the number of pairs of genes 
duplicated together over the total number of ancestral genes found in 
the same branch. We find that on average branch lengths are more 
than three times longer in plants, indicating genome architectures 
rapidly evolving compared with slow mammalian ones. Angiosperm 
genomes have been shaped by several whole genome duplications: 
at the basis of monocots, a triplication at the basis of dicots, plus one 
event on the Maize and Poplar lineages, and two on the Arabidopsis 
one. These events probably create a long branch in Poplar, or 
Glycine, but are not always visible (e.g. in Arabidopsis) due to 
differentiated losses which blurred the synteny signal. The difference 
in branch length can partly be due to whole genome duplications. But 
measuring the average size of the duplicated segments by computing 

#GDup 



#GDup — #Adup 

we found no significant difference between the two phyla (=1.08 
on average among all branches for both), indicating that the changes 
in genome architectures following a whole genome duplication are 
not fully accessible to this method. The long branch at the basis of 
eutheria would deserve more studies to know to which extend it is 
artif actual and due to the quality of gene trees. 



6 PERSPECTIVES 

The algorithm can easily be extended to handle other relations 
than adjacencies (interactions, regulations, co-expression or any 
functional relation which can evolve by gain or breakage like 
adjacencies). It can be seen as even more adapted to less constrained 
relations (without a linear organization). Indeed, if a gene is lost, 
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Fig. 4. Angiosperm and mammalian phylogenies, where branch lengths are 
proportional to the number of adjacency duplications normalized by the 
number of genes. The scale is indicated at the bottom left of the two figures 

no adjacency is automatically and freely gained between its two 
neighbors in this model. But the computation time should be higher 
for other relations, as the possible number of relations is a quadratic 
function of the number of genes, while the number of adjacencies 
only grows linearly. 

Po ssible extensions can be to include transfers foovon et all 
l201ll) , in complete lineage sorting or gene conversion ( Rasmus sen 
andKellis l2012l) to the possible events. And also to allow other types 
of reconciliations than the LCA one. More flexible cost functions for 
duplications may also be desirable, but in this case the independent 
evolution between adjacencies is lost, and the use of dynamic 
programming does not seem generalizable. 
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